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Abstract 

We discuss the possibilities and limitations of estimating the mean of a real-valued 
random variable from independent and identically distributed observations from a non- 
asymptotic point of view. In particular, we define estimators with a sub-Gaussian 
behavior even for certain heavy-tailed distributions. We also prove various impossibility 
results for mean estimators. 


1 Introduction 


Estimating the mean of a probability distribution P on the real line based on a sample = 
(Ai,..., Xn) of n independent and identically distributed random variables is arguably the 
most basic problem of statistics. While the standard empirical mean 

1 

n ^ 


is the most natural choice, its hnite-sample performance is far from optimal when the 
distribution has a heavy tail. 

The central limit theorem guarantees that if the Aj have a finite second moment, this 
estimator has Gaussian tails, asymptotically, when n —)• oo. Indeed, 


P |emp„(Af) -PpI > 


np - (5/2) 


n 


( 1 ) 


where pp and (Tp > 0 are the mean and variance of P (respectively) and $ is the cumulative 
distribution function of the standard normal distribution. This result is essentially optimal: 
no estimator can have better-than-Gaussian tails for all distributions in any “reasonable 
class” (cf. Remark [1] below). 
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This paper is concerned with a non-asymptotic version of the mean estimation problem. 
We are interested in large, non-parametric classes of distributions, such as 

V 2 '■= {all distributions over M with hnite second moment} ( 2 ) 

V 2 '■= {all distributions P € P 2 with variance Up = (cr^ > 0 ) (3) 

Pkrt^K := {all P G 7^2 with kurtosis < k} (k > 1 ), (4) 

as well as some other classes introduced in Section [3l Given such a class V, we would like 
to construct sub-Gaussian estimators. These should take an i.i.d. sample Xf from some 
unknown P G P and produce an estimate EniX^) of /ip that satisfies 

P ^|Pn(Xr) - /ip| > Lup 5 for all G 1) (5) 

for some constant L > 0 that depends only on V. One would like to keep (fmin as small as 
possible (say exponentially small in n). 

Of course, when n —)■ 00 with 6 fixed, ([5]) is a weaker form of ([T]) since — 6/2) < 

Y^2ln(2/5). The point is that ([ 5 ]) should hold non-asymptotically, for extremely small 
6, and uniformly over P G P, even for classes P containing distributions with heavy tails. 
The empirical mean cannot satisfy this property unless either P contains only sub-Gaussian 
distributions or 6min is quite large (cf. Section [2.3.1D . so designing sub-Gaussian estimators 
with the kind of guarantee we look for is a non-trivial task. 

In this paper we prove that, for most (but not all) classes P C P2 we consider, there do 
exist estimators that achieve ([5|) for all large n, with (5min ~ ” and a value of L that does 

not depend on 6 or n. In each case, c-p > 0 is a constant that depends on the class P under 
consideration, and we also obtain nearly tight bounds on how c-p must depend on P. (In 
particular, (i min cannot be super exponentially small in n.) In the specific case of bounded- 
kurtosis distributions (cf. ([1|) above), we achieve L < -\/2 -|- e for (5min ~ . This 

value of L is nearly optimal by Remark [T] below. 

Before this paper, it was known that m could be achieved for the whole class P 2 
of distributions with finite second moments, with a weaker notion of estimator that we 
call 5 -dependent estimator, that is, an estimator En = En,s that may also depend on the 
confidence parameter 5. By contrast, the estimators that we introduce here are called 
multiple-6 estimators: a single estimator works for the whole range of 5 G [(fmirul)- This 
distinction is made formal in Definition [T] below. By way of comparison, we also prove some 
results on (5-dependent estimators in the paper. In particular, we show that the distinction 
is substantial. For instance, there are no multiple-(5 sub-Gaussian estimators for the full 
class P 2 for any nontrivial range of (5min- Interestingly, multiple-(5 estimators do exist (with 
dmin ~ e~‘^^) for the class Pg (corresponding to fixed variance). In fact, this is true when 
the variance is “known up to constants,” but not otherwise. 
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Why finite variance? 

In all examples mentioned above, we assume that all distributions P € V have a finite 
variance Up. In fact, our definition ([5]) implicitly requires that the variance exists for all 
P & V. A natural question is if this condition can be weakened. For example, for any 
a G (0,1] and M > 0, one may consider the class of all distributions whose (l + a)-th 
central moment equals M (i.e., E [|X — = M if X is distributed according to any 

P G V^a)- It I® natural to ask whether there exist estimators of the mean satisfying ([5|) 
with fjp replaced by some constant depending on P. In Theorem 13.11 we prove that for 
every sample size n, 5 < 1/2, a G (0,1], and for any mean estimator En^s, there exists 
a distribution P G P^a such that with probability at least 6, the estimator is at least 

]^,ji/{i+a) away from the target /ip. 

This result not only shows that one cannot expect sub-Gaussian confidence intervals 
for classes that contain distributions of infinite variance but also that in such cases it is 
impossible to have confidence intervals whose length scales as 

Weakly sub-Gaussian estimators 

Consider the class p^^'^ of all Bernoulli distributions, that is, the class that contains all 
distributions P of the form 

P({l}) = l-P({0})=p, PG[0,1]. 

Perhaps surprisingly, no multiple-d estimator exists for this class of distributions, even when 
(5min is a constant. (We do not explicitly prove this here but it is easy to deduce it using 
the techniques of Sections 14.31 and 14.51 ) On the other hand, by standard tail bounds for 
the binomial distribution (e.g., by Hoeffding’s inequality), the standard empirical mean 
satisfies, for all d > 0 and P G 

P |^|emp„(Xr) - f,p| > < <5 . 

Of course, this bound has a sub-Gaussian flavor as it resembles ([5]) except that the confi¬ 
dence bounds do not scale by (Tp(log(l/(5)/n)^/^ but rather by a distribution-free constant 
times (log(l/(5)/ra)^/^. 

In general, we may call an estimate weakly sub-Gaussian with respect to the class V if 
there exists a constant a-p such that for all P G P, 

P ^l^n(xr) - pp\ > Lap + ^ ^ 
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for some constant L > 0. (5-dependent and multiple-(5 versions of this definition may be 
given in analogy to those of sub-Gaussian estimators. 

Note that if a class V is such that suppg-p up < oo, then any sub-Gaussian estimator 
is weakly sub-Gaussian. However, for classes of distributions without uniformly bounded 
variance, this is not necessarily the case and the two notions are incomparable. 

In this paper we focus on the notion of sub-Gaussian estimators and we do not pursue 
further the characterization of the existence of weakly sub-Gaussian estimators. 

1.1 Related work 

To our knowledge, the explicit distinction between (5-dependent and multiple-(5 estimators, 
and our construction of multiple-(5 sub-Gaussian estimators for exponentially small (5, are 
all new. On the other hand, constructions of (5-dependent estimators are implicit in older 
work on stochastic optimization of Nemirovsky and Yudin [14j (see also Levin [12j and 
Hsu m), sampling from large discrete structures by Jerrum, Valiant, and Vazirani [8], and 
sketching algorithms, see Alon, Matias, and Szegedy [T]. Recently, there has been a surge of 
interest in sub-Gaussian estimators, their generalizations to multivariate settings, and their 
applications in a variety of statistical learning problems where heavy-tailed distributions 
may be present, see, for example, Gatoni [5], Hsu and Sabato [7], Brownlees, Joly, and 
Lugosi [3], Lerasle and Oliveira Minsker m, Audibert and Gatoni [2], Bubeck, Cesa- 
Bianchi, and Lugosi [1]. Most of these papers use (5-dependent sub-Gaussian estimators. 
Gatoni’s paper [5] is close in spirit to ours, as it focuses on sub-Gaussian mean estimation 
as a fundamental problem. That paper presents (5-dependent sub-Gaussian estimators with 
nearly optimal L = y/2 + o (1) for a wide range of 6 and the classes Rf and Pkrt<K defined 
in ([3]). The (5-dependent sub-Gaussian estimator introduced by [3] may be converted into a 
multiple-(5 estimators with subexponential (instead of sub-Gaussian) tails for V 2 by choos¬ 
ing the single parameter of the estimator appropriately. Loosely speaking, this corresponds 
to squaring the term ln(l/(5) in ([5]). Gatoni also obtains multiple-(5 estimators for V 2 with 
subexponential tails. These ideas are strongly related to Audibert and Gatoni’s paper on 
robust least-squares linear regression [2]. 


1.2 Main proof ideas 

The negative results we prove in this paper are minimax lower bounds for simple families 
of distributions such as scaled Bernoulli distributions (Theorem [3T]), Laplace distributions 
with fixed scale parameter for (5-dependent (Theorem 14.3|) . and the Poisson family for 
multiple-(5 estimators fTheorem 14.4p . The main point about the latter choices is that it is 
easy to compare the probabilities of events when one changes the values of the parameter. 
Interestingly, Catoni’s lower bounds in [5] also follow from a one dimensional family (in 
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that case, Gaussians with fixed variance cr^ >0). 

Our constructions of estimators use two main ideas. The first one is that, while one 
cannot turn (5-dependent into multiple-5 estimators, one can build multiple-5 estimators 
from the slightly stronger concept of sub-Gaussian confidence intervals. That is, if for each 
5 > 0 one can find an empirical conhdence interval for /ip with “sub-Gaussian length”, 
one may combine these intervals to produce a single multiple-5 estimator. This general 
construction is presented in Section and is related at a high level to Lepskii’s adaptation 
method BM- 

Although general, this method of confidence intervals loses constant factors. Our second 
idea for building estimators, which is specihc to the bounded kurtosis case (see Theorem l3.6l 
below), is to use a data-driven truncation mechanism to make the empirical mean better 
behaved. By using preliminary estimators of the mean and variance, we truncate the 
random variables in the sample and obtain a Bennett-type concentration inequality with 
sharp constant L = \/2+o (!)■ A crucial point in this analysis is to show that our truncation 
mechanism is fairly insensitive to the preliminary estimators being used. 

1.3 Organization. 

The remainder of the paper is organized as follows. Section [2] fixes notation, formally defines 
our problem, and discusses previous work in light of our definition. Section [3] states our 
main results. Several general methods that we use throughout the paper are collected in 
Sectional Proofs of the main results are given in Sections [5] to [71 Section [8] discusses several 
open problems. 

2 Preliminaries 

2.1 Notation 

We write N = {0,1, 2,... }. For a positive integer n, denote [n] = {1,... , n}. |A| denotes 
the cardinality of the finite set A. 

We treat M and M” as measurable spaces with the respective Borel cr-fields kept implicit. 
Elements of are denoted by xf = (xi,..., Xn) with xi,... ,Xn G M. 

Probability distributions over M are denoted P. Given a (suitably measurable) function 
/ = f{X, 0) of a real-valued random variable X distributed according to P and some other 
parameter 6, we let 

P/ = P/(X,0)= [ f{x,9)P{dx) 

Jr 

denote the integral of / with respect to X. Assuming P X^ < oo, we use the symbols 
/ip = P A and Up = P — /ip for the mean and variance of P. 
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Z =d P means that Z is a random object (taking values in some measurable space) 
and P is the distribution of this object. =d P®” means that X” = (Xi,... ,X„) is a 
random vector in M”' with the product distribution corresponding to P. Moreover, given 
such a random vector Xf and a nonempty set S C [n], P^ is the empirical measure of Xj, 
i G B: 

We write P,i instead of P [„] for simplicity. 

2.2 The sub-Gaussian mean estimation problem 

In this section, we begin a more formal discnssion of the main problem in this paper. We 
start with the definition of a sub-Gaussian estimator of the mean. 

Definition 1 Let n be a positive integer, L > 0, 5min G (0,1). Let V be a family of 
probability distributions over M with finite second moments. 

1. (5-dependent sub-Gaussian estimation: a 6-dependent L-sub-Gaussian estimator 
for (P,n,(5niin) is a measurable mapping En '■ K” x [(5min) 1) —)• M such that ifV^V, 
6 G [(5min) 1); and Xp = (Xi,... ,Xn) is a sample of i.i.d. random variables distributed 
as P, then 

P ^|^n(xr,(5) - /ip| > Lap 5 . (6) 

We also write En,s{') for En{-,6). 

2. multiple-(5 sub-Gaussian estimation: a multiple-5 L-sub-Gaussian estimator for 

(P,n,(5min) is a measurable mapping En '■ ^ 'R such that, for each 6 G [(5min, 1), 

P G P and i.i.d. sample X^ = (Xi,... ,X„) distributed as P, 

P (^En{X^) - iap\ > Lap + ^ ^ ^7^ 

It transpires from these definitions that multiple-(5 estimators are preferable whenever 
they are available, because they combine good typical behavior with nearly optimal bounds 
under extremely rare events. By contrast, the need to commit to a (5 in advance means 
that (5-dependent estimators may be too pessimistic when a small 6 is desired. The main 
problem addressed in this paper is the following: 

Given a family V (or more generally a sequence of families Vn), find the smallest possible 
sequence 5min = <5min,n such that multiple-6 L-sub-Gaussian estimators for {V,n,6ram,n) 
(resp. {Vn-,'n,6ram,n)) exist for all large n, and with a constant L that does not depend on 

n. 
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Remark 1 (optimality OF sub-gaussian estimators.) Call a class V “reasonable” when 
it contains all Gaussian distributions with a given variance cr^ > 0. Catoni Proposition 
6.1] shows that, if 6 G (0,1), V is reasonable and some estimator En ,5 achieves 




/ip > 


rfjp\ 


< 5 whenever P gV , 


thenr>^ ^(1—<5). The same result holds for the lower tail. Since^ ^(1 —(5) rsj V21n(l/<5) 
for small 6, this means that, for any reasonable class V, no constant L < \f2 is achievable 
for small 5min; o,nd no better dependence on n or 5 is possible. In particular, sub-Gaussian 
estimators are optimal up to constants, and estimators with L < \/2 + o(l) are “nearly 
optimal. ” 


2.3 Known examples from previous work 

In what follows we present some known estimators of the mean and discuss their sub- 
Gaussian properties (or lack thereof). 

2.3.1 Empirical mean as a sub-Gaussian estimator 

For large n, cr^ > 0 fixed and Jmin 0, the empirical mean 



2=1 


is not a L-sub-Gaussian estimator for the class of all distibutions with variance This 
is a consequence of O Proposition 6.2], which shows that the deviation bound obtained 
from Chebyshev’s inequality is essentially sharp. 

Things change under slightly stronger assumptions. For example, a nonuniform version 
of the Berry-Esseen theorem im Theorem 14, p. 125]) implies that, for large n, is a 

multiple-5 (\/2 + e)-sub-Gaussian estimator for 5 mjn «). where 

Ps.r, = {P G P2 : P|^ - 

for some rj > 1) and 5min,n n“^/^(logSimilar results (with worse constants) 
hold for the class Pkrt<K (cf. dH)) when 5min I/n and k is bounded 0 Proposition 
5.1]. Catoni [5l Proposition 6.3] shows that the sub-Gaussian property breaks down when 
5min = o (1/n). Exponentially small 5 min can be achieved under much stronger assumptions. 
For example, Bennett’s inequality implies that effip^ is (\/2 + e)-sub-Gaussian for the triple 
{Poo,7j,n, drain), with 5min = and 

Voo,-n ■= {P £ ^2 : - hp\ <6^? a-s-} • 
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2.3.2 Median of means 


Quite remarkably, as it has been known for some time, one can do much better than the 
empirical mean in the (5-dependent setting. The so-called median of means construction 
gives L-sub-Gaussian estimators (with L some constant) for any triple {V 2 ^n,e^~'^/‘^) 
where n > 6. The basic idea is to partition the data into disjoint blocks, calculate the 
empirical mean within each block, and finally take the median of them. This construction 
with a basic performance bound is reviewed in Section 14.11 as it provides a building block 
and an inspiration for the new constructions in this paper. We emphasize that, as pointed 
out in the introduction, variants of this result have been known for a long time, see Ne- 
mirovsky and Yudin ffH, Levin [12], Jerrum, Valiant, and Vazirani |8], and Alon, Matias, 
and Szegedy [T]. Note that this estimator has good performance even for distributions with 
infinite variance (see the remark following Theorem 13.II below). 

2.3.3 Catoni’s estimators 

The constant L obtained by the median-of-means estimator is larger than the optimal value 
(see Remark dj). Catoni |5| designs (5-dependent sub-Gaussian estimators with nearly 
optimal L = \/2 -|- o(l) for the classes Vf (known variance) and Rkrt^K (bounded kurto- 
sis). A variant of Gatoni’s estimator is a multiple-(5 estimator, however with subexponential 
instead of sub-Gaussian tails (i.e., the y^ln(l/(5) term in (|7|) appears squared). Both esti¬ 
mators work for exponentially small (5, although the constant in the exponent for Pkrt<K 
depends on k. 

3 Main results 

Here we present the main results of the paper. Proofs are deferred to Sections d] to [7| 

3.1 On the non-existence of snb-Gaussian mean estimators 

Recall that for any a, M >0, denotes the class of all distributions on M whose (H-a)- 
th central moment equals M (i.e., E [|A — EA|^'’'“] = M). We start by pointing out that 
when (a < 1, no sub-Gaussian estimators exist (even if one allows (5-dependent estimators). 

Theorem 3.1 Let n > 5 be a positive integer, M > 0, a ^ (0,1], and 6 £ (2e“”/^, 1/2). 
Then for any mean estimator E^, 


sup ¥\\En{Xf,5) 


Tp\ > 


'm^/" ln(2/(5)' 


q /(1 + ( 3 !)' 


> (5 


n 
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The proof is given in Section 14.31 The bound of the theorem is essentially tight. It is 
shown in Bubeck, Cesa-Bianchi, and Lugosi [1] that for each M > 0, a G (0,1], and 5, there 
exists an estimator such that 

sup P \En{X^,5) - /ip| > 8^- \ <6 . 

P^T^^+c \ \ "^ ) ) 

The estimator En{X'^^5) satisfying this bound is the median-of-means estimator with ap¬ 
propriately chosen parameters. 

It is an interesting question whether multiple-d estimators exist with similar perfor¬ 
mance. Since our primary goal in this paper is the study of sub-Gaussian estimators, we 
do not pursue the case of inhnite variance further. 

3.2 The value of knowing the variance 

Given 0 < fii < <72 < oo, define the class of distributions with variance between and 

= {p e 7^2 : a? < 4 < (^ 1 } 

2 

This class interpolates between the classes of distributions with fixed variance and with 
completely unknown variance V 2 - The next theorem is proven in Section [5j 

Theorem 3.2 Let 0 < ui < <72 < oo and define i? = 172 / 171 . 

1. Letting L^^'i = (4e\/2 -|- 41n2)i? and for every n > 6 there exists a 

multiple-5 L^^^-sub-Gaussian estimator for {V 2 

2. For any L > \/2, there exist > 0 and <5^;^ > 0 such that, when R > , there is 

no multiple-5 L-sub-Gaussian estimator for for any n. 

3. For any value of R > 1 and L > \f2, if we let 5 ^},, = ^ there is no 5- 

dependent L-sub-Gaussian estimator for for any n. 

It is instructive to consider this result when n grows and R = Rn may change with n. 
The theorem says that, when sup„ Rn < 00 , there are multiple-d T-sub-Gaussian estimators 
for all large n, with exponentially small dmin and a constant L. On the other hand, if 
Rn —>■ 00 , for any constant L and all large n, no multiple-5 L-sub-Gaussian estimators exist 
for any sequence 5 = 5min,n —^ 0. Finally, the third item says that even when Rn = 1, 
5-dependent estimators are limited to 5min = e~^^'^\ so the median-of-means estimator is 
optimal in this sense. 
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3.3 Regularity, symmetry and higher moments 

Theorem 13.21 shows that finite, but completely unknown variance is too weak an assumption 
for multiple-(i sub-Gaussian estimation. The following shows that what we call regularity 
conditions can substitute for knowledge of the variance. 

Definition 2 For P G 7^2 cind j G N\{0}, let Xi,... ,Xj be i.d.d. random variables with 
distribution P. Define 


<7>pj and p+{P,j) 

Given k gN, we define the k-regular class as follows: 

P2,fc-reg = {P G P 2 : Vj > k, min(p+(P,j),p_(P, j)) > 1/3}. 

Note that this family of distributions is increasing in k. Also note that UfceNP 2 ,fc-reg = 
1^2, because the central limit theorem implies p+(P,j) ^ 1/2 and p-(P,j) —)■ 1/2. Here 
are two important examples of large families of distributions in this class: 

Example 3.1 We say that a distribution P G P 2 symmetric around the mean if, given 
X P, 2;Up — X =iiP as well. Clearly, ifP has this property, p+(P,j) = p-{P,j) = 1/2 
for all j and thus P G P 2 ,i-reg- In other words, P 2 ,sym C P 2 ,i-reg where P 2 ,sym is the class 
of all P G P 2 that are symmetric around the mean. 

Example 3.2 Given ri>l and a G (2,3], set 

Pa,r, = {PeiP2 : P|X-^p|“< (7?Upn. (8) 

We show in Lemma \6.S\ that, for P in this family, min(p+(P, j),p_(P, j)) > 1/3 once 

2a 

j > fan « constant Ca depending only on a. We deduce 

'Pa,r, C V2,k-reg if k > {Cafi)~^. 

Our main result about /c-regular classes states that sub-Gaussian multiple-(5 estimators 
exist for P 2 ,fc-reg in the sense of the following theorem, proven in Section 16.11 

Theorem 3.3 Let n, k be positive integers with n > (3-|-ln4) 124fc. Set 5.^\.a,n,k = 

and L* = 4y^2 (1 + 21n2) (1 -|- 621n(3)) . Then there exists a L^,-sub-Gaussian multiple- 

5 estimator for ('P2,fc-reg, n, (5min,n,fc)- 
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We also show that the range of (5niin = in this result is optimal. This follows 

directly from stronger results that we prove for Examples 13.11 and 13.21 In other words, the 
general family of estimators designed for fc-regular classes has nearly optimal range of 6 
for these two smaller classes. The next result, for symmetric distributions, is proven in 
Section 16.21 

Theorem 3.4 Consider the class P 2 ,sym defined in Examvle \3.1[ Then 

1. the estimator obtained in Theorem \3.S\ for k = 1 is a L^,-sub-Gaussian multiple- 
6 estimator for {1^2,sym-, n, hmin,n,i) when n > (3 + In 2) 124; 

2. on the other hand, for any L > y/2, no 6-dependent L-sub-Gaussian estimator can 
exist for (P 2 ,sym, n, 

We also have an analogue result for the class Va,r]- The proof may be found in Sec¬ 
tion [6]3l 

Theorem 3.5 Fix a € (2,3] and assume rj > 3^/^ 2^/®. Consider the class 'Pa,ri defined 
in Example \3.^ Then there exists some Cq, > 0 depending only on a such that if ka = 

1. the estimator obtained in Theorem 1,9.,11 for k = ka is a L^,-sub-Gaussian multiple- 
6 estimator for (Vg h min r 7 .when n > (3-|-ln4) 124A:„; 

2. on the other hand, for any L > \/2, there exist no,a,L G N and Ca,L > 0 such that 

no multiple-6 L-sub-Gaussian estimator can exist for n, when n > 

^o,a,L is large enough; 

3. finally, for L > y/2 there is no 6-dependent L sub-Gaussian estimator for {Va,r],n, 

3.4 Bounded kurtosis and nearly optimal constants 

This section shows that multiple-h sub-Gaussian estimation with nearly optimal constants 
can be proved when the kurtosis 


Kp 


]E(X - ;/p) 




4 


(when X P) is uniformly bounded in the class. (For completeness, we set Kp = 1 when 
Up = 0.) More specifically, we will consider the class of all distributions P G P 2 with 

Kp < K. 
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To state the result, let 6max be a positive integer to be specified below. Also define 


^ = 2V2k^ + 36./^^ + 1120V^^ . 

n \ n n 

Note that when 6max ^ ^ = o(l). The main result for classes of distributions 

with bounded kurtosis is the following. For the proof see Section [71 

Theorem 3.6 Let n> A, L = V2(l + 0, There exists an absolute con¬ 

stant C such that, f/^ftmax/^ < C, then there exists a multiple-6 L-sub-Gaussian estimator 
for iVi,n,6^^J. 

This result is most interesting in the regime where n ^ oo, k = Kn possibly depends 
on n and n/Kn —>■ oo. In this case, we may take 6max ^ and obtain multiple- 

6 (\/2 + o (l))-sub-Gaussian estimators (Pkrt<K, ~ Catoni [5] ob¬ 

tained (5-dependent y/2 -\- o (l)-estimators for a smaller value (5^;^^^ ~ e~'^l'^. In Remark [2] 
we show how one can obtain a similar range of 6 with a multiple-(5 estimator, albeit with 
worse constant L. 

4 General methods 

We collect here some ideas that recur in the remainder of the paper. 

1. Section im presents an analysis of the median-of-means estimator mentioned in Sec¬ 
tion 12.3.21 above. We present a proof based on Hsu’s argument [6] . 

2. Section 14.21 presents a “black-box method” of deriving multiple-(5 estimators from 
confidence intervals. The point is that confidence intervals are “(5-dependent objects”, 
and thus easier to design and analyze. 

3. In Section ITHl we use scaled Bernoulli distributions to prove the impossibility of design¬ 
ing (weakly) sub-Gaussian estimators for classes with distributions with unbounded 
variance. 

4. Section l4~4l uses the family of Laplace distributions to lower bound (5 min for (5-dependent es¬ 
timators. 

5. Section 14.51 uses the Poisson family to derive lower bounds on (5min for multiple- 
6 estimators. 

A combination of the above results will allow us to derive the sharp range for ln(l/(5niin) 
for all families of distributions we consider. 
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4.1 Median of means 

The next result is a well known performance bound for the median-of-means estimator. We 
include the proof for completeness. 


Theorem 4.1 For any n > 4 and L = 2\/2e there exists a 5-dependent L-sub-Gaussian 
estimators for {V 2 ,n, 

Proof: We follow the argument of Hsu [6]. Given a positive integer b and a vector x\ G 
we let qi /2 denote the median of the numbers xi,X 2 ,... ,Xb, that is, 

(li/ 2 {x’i) = Xi, where #{k G [6] : Xk < Xi} > ^ and #{A: G [b] : Xk > xj > 

(If several i fit the above description, we take the smallest one.) We need the following 
Lemma (proven subsequently): 


Lemma 4.1 Let = [Yi,... ,Yb) G be independent random variables with the same 
mean p, and variances bounded by . Assume Lq > 1 is given and Mb = qi/20^i)- Then 
r{\Mb - p\> 2 Lo a) < Lq'^. 


In our case we set Lq = e = L/2\/2. To build our estimator for a given 5 G [e^ 1), 

we first choose 

h= rin(l/5)] 

and note that b < n/2. 

Now divide [n] into b blocks (i.e., disjoint subsets) Bi, 1 < i <b, each of size \Bi\ > k = 
[n/b\ >2. Given xf G M"’, we dehne 


yn,s{xf) = (2/n,5,i(xi))'=i G M'’ with coordinates yn,s,iixf) 



E 

j&Bi 




and define the median-of-means estimator by En, 5 {xf) = qi/ 2 {Vn, 5 {xf))- 

We now show that is a sub-Gaussian estimator for the class V 2 - Let Xf =d P'^”' 
for a distribution P G P 2 - i^i) is the median of random variables 


= ilT E 


i£[b]. 


3&Bi 


Each Yi has mean p-p and variance a^/^Bi < Upjk. Then, using our choice of b, Lemma 
14.11 implies 

EnAX^) - PP\ > < ^0 ' < • 
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Now, because b = [111(1/(5)] < n/2 

k = 


n n n 

~ b ~ 2b ~ 2(1 + ln(l/(5)) ’ 


and 


Therefore, 


2Lo ^ 2LoV2 Vl + ln(l/(5) _ /l + ln(l/,5) 


\/k 


y/n 


= L 


n 


ln(l/^) \ 

n I 


<s, 


P -/ip| > Lap 

and since this works for any P £ P 2 ) the proof is complete. □ 
Proof of Lemma ITTl - Let I = [fj, — 2 Lq a, + 2 L^a]. Clearly, 


7 7 

i=i 

The indicators variables on the right-hand side are all independent, and by Chebyshev’s 
inequality, for all j £ [b ], 


, , E[(y-^)2] 1 

P (Yj 0 I) < ^ < 


44Lq 


We deduce that X]j=i 1 {Yj 0 /} is stochastically dominated by a binomial random variable 
Bin(6, (2Lo)“^) and therefore. 


< 


{2Loy 


'(Mb0/)<P Bin(6,(2L°)-2)>- = 

^ ^ fc=r6/2i 

1 \ [ 6/21 b 

^ E 

fc=rb/2l 


kj V(2Lo)2 


1 - 


(2To)^ 


b—k 


k'-^^ 


-b 


Since 


Ei.rv2i (i) < Eto (1) = 2". 


4.2 The method of confidence intervals for multiple-(5 estimators 

In this section we detail how sub-Gaussian confidence intervals may be combined to produce 
multiple-(5 estimators. This will be our main tool in defining all multiple-5 estimators whose 
existence is claimed in Theorems IQ and E31 First we need a definition. 
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Definition 3 Let n be a positive integer, 6 € (0,1) and let V he a class of probability 
distributions overE.. A measurable closed interval In,si') = [on,5(')) ^n,<s(')] consists of a pair 
of measurable functions dn, 5 , bn ,5 '■ K” —>• K with dn,s < bn, 5 - We let in,6 = bn,6 — dn,s denote 
the length of the interval. We say {-^n,5}5e[5njin,i) 'Is a a collection L-sub-Gaussian confidence 
intervals for {n,V, (5min) if for any P £ P, if Xi =d P®"’, then for all 6 £ [(5min, 1); 

p G In, 5 {Xf) and in,s{Xf) < Lap 

The next theorem shows how one can combine sub-Gaussian confidence intervals to 
obtain a multiple-(5 sub-Gaussian mean estimator. 

Theorem 4.2 Let n be a positive integer and let V be a class of probability distributions 
over M. Assume that there exists a collection of L-sub-Gaussian confidence intervals for 
{n,V, drain)- Then there exists a multiple-5 estimator En : M” —^ M that is Esub-Gaussian 
for {n,V,2-^), where V = L Vl + 2In2 and m = [log2(l/(5min)J - 1 > log2(l/(5min) - 2 
(in particular, 2 “™' < 4:5min)- 

Proof: Our choice of m implies that, for each k = 1,2,3,... , m-|-l there exists a measurable 
closed interval Ik{-) = [afc(-), 6fc(-)] with length ik{-), with the property that, if P £ P and 
Xf =d P®*^, the event 

Gk ■■= G h{Xf) and 4(^r) <Lap (9) 

has probability P {Gk) > 1 — 2“^. To define our estimator, define, for x” £ M”, 


kn{xf) = min 


m I 

kG[m] : p|4 (x^)^0l 
j=k } 


One can easily check that 


m 

Ij{xf) is always a non-empty closed interval, 

j=k„ix'l) 


SO it makes sense to define the estimator En{xi) as its midpoint. 

We claim that En is the sub-Gaussian estimator we are looking for. To prove this, we 
let 2“”^ < 5 < 1 and choose the smallest k £ {1,2,..., m -|- 1} with < 5. Assume 
Xn p«)n p e p. Then 
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1 - IP (njLV Gj) > 1 - 2-^ - -> 1 - 2^-^ > 1 - 5 by © and the choice of k. 

2. When holds, /ip G /j(Xf) for all A; < j < m + 1, so /ip G 

particular, (~]JLk ® kn{Xi) < k. 

3. Now when < k, En{Xi) G C\jLk well, so both En{Xf) and /ip 

belong to /fc(Xf'). It follows that \En{X^) — /ip| < ifc(Xf'). 

4. Finally, our choice of k implies < 5 < 2^“^, so, under have 


4(^r) < Ltrp 


1 +ln(2*^) 


n 


^ L (Tp 


1 + 21n2 + ln(l/(5) 


n 


^ L' (Tp 


1 +ln(l/5) 


n 


with L' = L^/Y^\X2hn as in the statement of the theorem. 
Putting it all together, we conclude 


|.E„(X")-/ip| <LVp 



>1-5, 


and since this holds for all P G P and all 2 "* < 5 < 1/2, the proof is complete. □ 


4.3 Scaled Bernoulli distributions and single-5 estimators 

In this subsection we prove Theorem 13.11 In order to do so, we derive a simple minimax 
lower bound for single-5 estimators for the class Pc,p = {P+^P-} of distributions that 
contains two discrete distributions defined by 

P+({0}) = P_({0}) = 1-P, P+{{c}) = P-{{-c}) = P , 

where p G [0,1] and c > 0. Note that /rp+ = pc, pp_ = —pc and that for any a > 0, the 
(I -|- a)-th central moment of both distributions equals 

m = 4+Xi-p)(p“ + (i-p)") . (10) 

For z = 1,..., n, let (W, 4) be independent pairs of real-valued random variables such 
that 

P{Xj = Y) = 0} = 1 — p and P{Xj = c, Y) = —c} = p . 

Note that Xi ~ P+ and Y) ~ P_. Let 5 G (0,1/2). If 5 > 2e~'^P and p = (2/n) log(2/5), 
then (using 1 — p > exp(—/5/(l —p))), 

= X} = (1 - pY > 25 . 
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Let En ^5 be any mean estimator, possibly depending on <5. Then 


max I En^5{X'^) - 

1 


Cpj , P I En,s{yr) - > Cpj) 

> cp or En,&{Y^) - pp_ > cpj 


> -F{En4X^) = Er,4Y{^)} 

> ]^F{X^ = Y{^} > 6 . 

From (fT0]i we have that cp > (p/2)"/*^^'''") and therefore 

a/{l+a) ' 


max I^P I En,s{X^) - pp_ 

pi KAYl)-^^P- 


/ MV“ ^ 2 

> —‘“''i 


/ , 2 


a/{l+a) ' 


> 5 


Theorem 13.11 simply follows by noting that Vcp C . 


4.4 Laplace distributions and single-5 estimators 

This section focuses on the class of all Laplace distibutions with scale parameter equal to 1. 
To dehne such a distribution, let A G M and let Lba be the probability measure on M with 
density 

dLaA, , 

-;- (X) = - . 

dx 2 

Denote by Via = {Lsa : A G M} the class of all such distributions. 

A simple calculation reveals that for all A G M, the mean, variance, and central third 
moment are pia^ = A, = 2 and LaA|X — A|^ = 6 < {paia^)^ with p = 3^/^ 2^/®. 

The next result proves that 5-dependent L-sub-Gaussian estimators are limited to ex¬ 
ponentially small 5 even over the one-dimensional family Via ■ 

Theorem 4.3 If n > 3 then, for any constant L > y/2, there are no 5-dependent L-sub- 
Gaussian estimators for {Via,n,e^~^^ ”). 


Proof: We proceed by contradiction, assuming that there exist L-sub-Gaussian 5-dependent es¬ 
timators E^^s for {Via,n,5) where 5 = e^~^^ ” and arbitrarily large n. We set 

A = 2L^/2{lPln{l/5))/n 
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and consider Xf =d Lag and =d La^"". The triangle inequality applied to the exponents 
of d\-5\/dx and dLs^/dx shows that the densities of the two product measures satisfy, for 


all x^ G 


dLag 

dx'i 


"It ^ 


d\-3x 

dx'i 




and therefore, 

P > ^) > e-"-P (yK&iYi) > • (11) 

Using the definition of A and the fact that = A and = 2, we see that the right-hand 
side above is simply 


^n,dK^ 1 


On the other hand, the left-hand side in (lll|) is 


En,s{^?) Y /ii_a„ -|- Lai 


/l + ln(lM)\ 


We deduce 

^? < 2(5. 

1 — (5 

If we use again the definition of A, we see that 

^-2Ly/n{l+\n{l/6)) < 2 


or 


g—2\/5^ 12^—bL'^n 


n < 


1 -Mn2 

L2 (5 - 2^5) ■ 


For L > y/2, some simple estimates show that this leads to a contradiction when n > 3. 
□ 


4.5 Poisson distributions and multiple-A estimators 

We use the family of Poisson distributions for bounding the range of conhdence values of 
multiple-(5 estimators. Denote by Po;^ the Poisson distribution with parameter A > 0. Given 
0 < Ai < A 2 < 00 , define 

Ho’^^' = {Poa : Ag[Ai,A2]}. 
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Theorem 4.4 There exist positive constants co,so and a function cf : M_|_ —)■ M_|_ such 
that the following holds. Assume L > y/2 and n > 0 are given. Then there exists no 
multiple-5 L-sub-Gaussian estimator for 7 ^^ gi-so 


Proof: We prove the following stronger result: there exist constants co,s > 0 such that, 
when c > cq, L > \f2 and C = \s (L^lnL)], there is no multiple-J sub-Gaussian estimator 
for 



(l+2C)c j 


1 - 


,n,e 



The theorem then follows by taking 2C = (i){L) — 1 = s Lf'lnL and sq = 
We proceed by contradiction. Assume 


A 


n 

1 



p igin 

^°(1+2C) c/ 


n 


and that there exists an L-sub-Gaussian estimator En '■ M”' —?■ M for (*) above. We use the 
following well-known facts about Poisson distributions. 

FO upo , = = cln and upo,, , = (1 + 2C)cln. 

^r-Oc/n POc/„ ' C"rO(l + 2C)c/n PO{l+2C)c/n ' ' ' 

FI Sx = Ai -|- X 2 Xn =d Poc and Sy = Fl -|- ^2 + • • • + ^ =d Po(i+2C) c- 

F2 Given any A: S N, the distribution of Xf conditioned on Sx = k \s the same as the 
distribution of F” conditioned on Sy = k. 

F3 P {Sy = (1 -I- 2C) c) > l/4y^(1 -|- 2C) c if G > 0 and c > cq for some cq. (This follows 
from the fact that Pom({m}) = e~'^rn^/m\ is asymptotic to ljy/2'Km when m —)• 00 , 
by Stirling’s formula.) 


F4 There exists a function h with 0 < h{C) ~ (1 -I- G) ln(l -|- G) such that, for all c > cq, 
P [Sx = (1 + 2 G) c) > This follows from another asymptotic estimate proven 

by Stirling’s formula: as c ^ 00 


hl+2C)c 

Po.({(l+2C)c}) = e-^^^^ 


g-[(l-P2C) ln(l+2C)-2q c 
V^27r(l + 2G)c 


We apply the sub-Gaussian property for the triple (*) to <5 = l/dy^(1 -|- 2G) c. This is 
possible because, for G = [s {L? InL)] with a large enough s, this value is « IjL^/ s InL c, 
which is much larger than the minimum confidence parameter e^~^ allowed by (*) (at 
least if c > Co with a large enough cq). Recalling FO, we obtain 

P (nF^F-) < (1 + 2G) c - L v^(l + 2G) c(l + ln( 8 V(l + 2G) c))) <-^=L= . 
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Therefore, by F3, 

P (nEn{Yl^) <{l + 2C)c-L ^(1 + 2C) c(l + ln{8^/{lT2^)) | = (1 + C) c") < 1/2. 


Now FI implies that the left-hand side is the same if we switch from Y to X. In particular, 
by looking at the complementary event we obtain 

P (nKiX^) >il + 2C)c-L Y^(l + 2C) c(l + ln{8y/0T2C)Z)) | = (1 + 2C) > 1/2. 

( 12 ) 

Since we are taking c > cq and C > s In L, a calculation reveals 


(1 + 2C) c(l + ln(8V(l + 2C)^ = O 


= o(c . 


Therefore, by taking a large enough cq we can ensure that 

(1 + 2C) c(l + ln(8\/(l + 2C) c)) < Cc . 

So (fT^ gives 

P (nEn{X^) > (1 + C) c I Sx = (1 + 2C) c) > 1/2. 
We may combine this with F4 to deduce: 

(.-h{C)c 


F(nEn{X^) > (1 + C)c) > 


(13) 


We now use FO to rewrite the previous probability as 


nEniX^) > (1 + C) c = P EniX^) - /ip > Lap 


'\/l + ln(l/(5o) \ 


where 


1 _ ^ 

So = e^ T2 


Since we assumed En is L-sub-Gaussian for the triple (*), we obtain 


g h { C)c / ^ \ 1 c 

- - - < P {nE{Xf) >il + C)cj< 

Comparing the left and right hand sides, and recalling c > cq, we obtain h{C) > G^/dL^ — 
1 — (ln2/co). This is a contradiction if G ^ L^lnL because h{C) grows like GlnG (cf. 
F4). This contradiction shows that there does not exist a L-sub-Gaussian estimator for 
(*), as desired. □ 
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5 Degrees of knowledge about the variance 


In this section we present the proof of Theorem 13.21 This is mostly a matter of combining 
the main results in the previous section. Recall that we consider the class 



{P S P 2 • ^ ^ ^2'} 


and that R = 02 !(y\. The three parts of the theorem are proven separately. 


Part 1: (Existence of a multiple-d estimator with constant depending on i?.) Theorem l4.ll 
ensures that, irrespective of a\ or cj 2 , for all b G 1) there exists a d-dependent esti¬ 

mator En,s '■ IK” —t M with 


\En,s{X^) - /Xp| > 2V2, 


eap 


l + ln(l/,5) 


n 


< 5 


(14) 


whenever X” = P®"' for some P G P 2 - We define a confidence interval for each 5 via 


In,5(^1) = 


-2V2ea2 .7i±SlZ5, E„,fc(x?) + 2^26^2 

V re V re 


r 2 2i 

Clearly, dm) and the fact that a 2 < Rap for all ^j^^t {In,s}sele^-^/^ 1 ) ^ 

r 2 2i 

4\/2 eR-sub-Gaussian conhdence interval for ( 7^2 ’^^ ,re, Applying Theorem 14.21 

gives the desired result. 


Part 2: (Non-existence of multiple-5 estimators when R > (j)^‘^\L).) We use Theorem 14.41 
By rescaling, we may assume ref = co/re, where cq is the constant appearing in Theo¬ 
rem 14.41 We also set := 4>{L) for (p{L) as in Theorem 14.41 The assumption on 

R ensures that ^ so there cannot be a L-sub-Gaussian estimator 

when 5^j^(L) = co ^ 


Part 3: (Non-existence of 5-dependent estimators when 5min = 5L‘^n '^ gy rescaling, 

r 2 2i 

we may assume ref = 2. Then the class V\_a in Theorem 14.31 is contained in P 2 ’ ^ the 
theorem implies the desired result directly. 


6 The regularity condition, symmetry and higher moments 

In this section we prove the results described in Section 13.31 
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6.1 An estimator under ^-regularity 

We start with Theorem 13.31 the general positive result on /c-regular classes. 

Proof of Theorem 13.31 By Theorem 14.21 it suffices to build a 4Y/27r+”62ln(3y) e^-sub- 
Gaussian conhdence interval for ('P 2 ,fc-reg) 

To build these intervals, we use an idea related to the proof of Theorem 14.11 Just like 
in the case of the median-of-means estimator, we divide the data into blocks, but instead of 
taking the median of the means, we look at the 1/4 and 3/4-quantiles to build an interval. 

To make this precise, given a € (0,1), we define the a-quantile qaivi) of a vector y\ S 
as the smallest index i S [b] with 


#{/ G [b] ■ Vj <yi}>ab and #{£ G [b] : ye > yj > (1 - a) b. 


The next result (proven subsequently) is an analogue of Lemma 14.11 


Lemma 6.1 Let = (Yf,... ,Yh) G 6e a vector of independent random variables with 
the same mean y and variances bounded by <7^. Assume further that P (1/ < y) > 1/3 and 
P {Yi > y) > 1/3 for each i G [6]. Then 

p (/i e [<71/4(1/'), 93/4(1/')] and 93/4(1/') - <71/4(1/') < 2Loa) > 1 - 3 6"''', 

where d is the numerical constant 



0.0164 > 


1 


and Lq = 2 e^^^ 2 < 2 e 2 . 


Now fix J G [e^ n/(i24fc)^ yy-g jgf^j^e a confidence interval In, si') as follows. First set 
b = [621n(3/(5)] and note that 

b < 621n(3/e3"’"/(^2^'=)) + l<^< n/2. (15) 

ZiKi 

Partition 

[n] = U .82 U • • • U .Bb 

into disjoint blocks of sizes \Bi\> \n/b\. For each i G [6] and xf G M”, we define 
Vl(A) = ( 91 ( 1 "), ■ ■ ■ .9,(9?)) where ytixl) = /— xj 

ki. 


and set, for x” G M”, 


\s{xf) 


q^l,iy\ixf)\q^l,iy\ixf)) 
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Claim 1 {/n,(5(')}<5e[e3-"/(i24fc)_i) is a 4y^2 (1 + 62 ln(3)) e 2 -sub-Gaussian collection of con¬ 
fidence intervals for 

To see this, we take a distribution P in this family and assume X” piSm^ gg^ 
s = [n/b\. Because the blocks Bi are disjoint and have at least s elements each, the 
random variables 

F, = yfiXf) = Ps^X, 

all have mean y-p and variance < fjp/s. Moreover, using (|15l) . 


s = 


n 

Cb 


Tl 

> - l>2k-l>k, 

b 


so the /c-regularity property implies that for all i S [6], 

P {Yi < Ai) > P {Yi >IJ)> 


Lemma 16.11 implies 




/ip e In, 5 {Xi) and length of InA^i) < 2 Lq ^ J >1-3 >1-6 (16) 

by the choice of b and the fact that d > 1/62. To finish, we use (I15p and the definition of b 
to obtain 

s [n/b\ - (n/b) -1 - n - n ' n 

Plugging this back into (jl6p and recalling Lq < implies the desired result. 

Proof of Lemina l6.lt Define J = [/i — LqO', fj, + Loa]. Assume the following three properties 
hold. 

1- Qi/4:{Yi) < /i. 

2- qs/iiYi ) > ti¬ 
ll. The number of indices i € [6] with Yi £ J is at least 36/4. 

Then clearly /i G [( 7 i/ 4 (Y/’), (/ 3 / 4 (Y/’)]. Moreover, item 3 implies that ( 7 i/ 4 (y/'), ( 73 / 4 (Y/’) G J, 
so that 

Q3/4 {Yi) - qi/4{Yi) < (length of J) = 2Lo a. 
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It follows that 


p ^ [qi/4iYi^),q3/4{yi)] or g3/4(n') - qi/4{Y,^) > 2Loa) 

< P [qi/,{Yl) >f?j+F ((Z3/4(n') < ir) + P (#{i S [b] : Y, ^ J} > 6/4). (17) 

We bound the three terms by e~^'^ separately. By assumption, P (1/ < /r) > 1/3 for each 
i G [6], Since there events are also independent, we have that Yl’i=i ^ {Yi < 14 stochastically 
dominates a binomial random variable Bin(6,1/3). Thus, 


P 



'£^{Y^<^^}<b/4: 


< P(Bin(6,l/3) < 6/4) < 


by the relative entropy version of the Chernoff bound and the fact that d is the relative en¬ 
tropy between two Bernoulli distributions with parameters 1/4 and 1/3. A similar reasoning 
shows that P {qs/^iY^) > pi) < e as well. 

It remains to bound P (#{i G [6] : T) 0 J} > 6/4). To this end note that for all i G [6], 

P(yi ^ J) =P(|yi-ir| >Loc7) < ^, (18) 

-^0 

and these events are independent. It follows that 


P(#{1 G [6] : y, 0 J} >6/4) 
(union bound) 

(independence of Yi + (fT8l) i 

((fc) < (eb/k)’^ for all 1 < fc < 6) 
(6 < 4 [6/4] and Lq = 


< P 


< 


Wc[b], A| = rfc/4] ieA 


, max 
[6/4]/ Ac[b],\A\=lb/4] 


< 

< 

< e 




b 

[6/4] 

eb \ 

Li rv4i J 

-4411 


1 \-rii 


< e 


—bd 


\ieA / 


6.2 Symmetric distributions 

To prove Theorem 13.41 notice that the existence of the multiple-(5 sub-Gaussian estimator 
follows from Theorem 13.31 The second part is a simple consequence of Theorem 14.31 and 
the fact that Laplace distributions are symmetric around their means. 
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6.3 Higher moments 

In this section we first prove that T’a,rj C 'P 2 ,fc-reg for large enough k, and then prove 
Theorem 13.51 We recall the definition of min(p+(P, j) and p-(P,j) from Definition [2l 

2q: 

Lemma 6.2 For all a G (2,3], there exists C = Ca such that, if j > {Cari)°‘-^, then 
min(p+(P,j),p_(P, j)) > 1^. 


Proof: We only prove that p+(P,j) > 1/3, as the other proof is analogous. 

Let be a standard normal random variable. Take some smooth function T : M —)■ M 
with bounded second and third derivatives, such that = 0 for x G (—oo,0], 0 < 

T(x) < 1 for x > 0 and E ['I'(A^)] > l/-\/6. (It is easy to see that such a T exists.) Also let 
Xl =d and assume, without loss of generality, that up > 0. Then 


P+{PJ) = E® 



^(Alj — ^p) > 0 


> E 






? “ hp) 


Lindberg’s proof of the central limit theorem (see |16jl. specialized to the case where 
are i.i.d., gives 


E 





j 

'^{Xi - p'p) 
i=l 


>E[T(iV)]-CojP 


V ^pVj j. ’ 


where (/)(t) =t‘^At^ and Cq > 0 is a universal constant. Since E ['l'(iV)] > l/-\/6 >1/3 and 
4 >{t) < t", we obtain 


E 






'^{Xi - fip) 




The right-hand side is > 1/3 when j > (Cri)°‘-^ for some universal C = Ca- n 

Proof of Theorem \3.5}: The positive result follows directly from Theorem l3.3l plus Lemma r6.2l 
which guarantees p±(P,j) > 1/3 for j > ka- For the second part, we first assume rj > r]Q for 
a sufficiently large constant r]Q. We use the Poisson family of distributions from Section [4.51 
For A = o (1) and a G (2, 3], we have that 

Poa|X-A|" = (1-Fo( 1))A = (l-Fo(l))crpo^A"^. 

If we compare this to Example 13.21 we see that Poa G Pa,r/ if A > h for some 

constant h = ha > 0 (recall we are assuming that r] > r]Q is at least a large constant). Now 
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take c > 0 such that c/n = h If c > cq for the constant cq in the statement of 

Theorem 14.41 we can apply the theorem to deduce that there is no multiple-(5 estimator for 
(P||(Vn,(/)(L) c/n]^ c)^ Noting that c is of the order n/ka finishes the proof in this case. 

Now assume r] < r]Q. In this case we use the Laplace distributions in Section [4.41 Since 
2 < a < 3, we may apply the fact that the central third moment of a Laplace distribution 
satisfies La^lX — Ap = 6 < (3^/^ 2^/®cTLa;^)^ to obtain 

LaA|X - A|“ < (LaA|X - A^)"/^ < J“- 

Our assumption on rj implies that V[_a C 'Pa,ri- Thus Theorem 14.31 implies that there is no 
(5-dependent or multiple-5 sub-Gaussian estimator for This is the desired 

result since ka is bounded when q < t]q. 

Finally, the third part of the theorem follows from the same reasoning as in the previous 
paragraph. 

7 Bounded kurtosis and nearly optimal constants 

In this section we prove Theorem 13.61 Throughout the proof we assume X =d P and 
=d P'^”' for some P G 'Pkrt^K^ and let 6max; C, ^ be as in Section 13.41 Our proof is 
divided into four steps. 

1. Preliminary estimates for mean and variance. We use the median-of-means technol¬ 
ogy to obtain preliminary estimates for the mean and variance of P. These estimates 
are not good enough to satisfy the claimed properties, but with extremely high prob¬ 
ability they are reasonably close to the true values. 

2. Truncation at the ideal point. We introduce a two-parameter family of truncation- 
based estimators for /xp, and analyze the behavior of one such estimator, chosen under 
knowledge of pvp and a-p. 

3. Truncated estimators are insensitive. Finally, we use a chaining argument to show 
that this two-parameter family is insensitive to the choice of parameters. 

4. Wrap up. The insensitivity property means that the preliminary estimates from Step 
1 are good enough to “make everything work.” 

We conclude the section by a remark on how to obtain a broader range of 5min with a 
worse constant L. 
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Step 1. (Preliminary estimates via median of means.) Denote by = /^6max(^r) the 
estimator given by Theorem 14.11 with 6 = which is possible if C > 6/(1 — log 2). 

The next lemma provides an estimator of the variance. 


Lemma 7.1 Let Bi,..., denote a partition of [n] into blocks of size \Bi\ > k 

[n/6maxj > 2. For each block Bi with i G [6 ma x] . define 


a? = 


BA BA - 


g.| _ i^j - Xkf and = qi/ 2 {di, • • •, ?w) 


j^keBi 


Then 


^ ^I^Lax - ^p| ^ 2ev^6(K + 


> 1 - e“ 


In particular, if 


96e(K + 3) b„ 


n 


< 1 , 


then 


IP ^l^fcmax - I^pI < 2V2ep6^^,y^^ 


3 

and T? < -a% 1 > 1 — 2e“ 


n - 2 P ' - 


Proof: Compute 


E [^t] = 


+ 


\BAH\BA-iy 

6 

BA^{\BA-ir 


Y E[(X,--X,)^] 

{hk)&Bf'> 

Y, E[(X,--Xfc)2(X,--X/ 


+ 


|B.P(|B.| - 1)^ E 


Expanding all the squares, using independence and noticing that E [Xj — /ip] = 0, we get 

E [{Xj - Xk)^] = 2{kp + 3)4 , 

E [{Xj - Xkf{Xj - Xif] = {kp + 3)4, E [{Xj - Xkf{Xi - X^)^] = 44 . 


Therefore, 



3(kp + 3) 

\BA 



4<e[4]' + 6(k + 3)4^ 
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Lemma l4.II with Lq = e gives then 


1^2 2 

Wb “ 

I L/max ^ 


> 2eV6(« + 3 ) 4 y^ 


< e 


-bn 


In particular, we get 


>l-e 


The theorem follows by the definition of and an application of Theorem 14.11 □ 


Step 2: (Two-parameter family of estimators at the ideal point.) Given /r and R dehne, 
for all X G M, 




/i + 



(x 


fi) . 


Lemma 7.2 Assume 6max > t, R = ap -^/nlbmax and /r = /rp. Then, with probability at 
least 1 — 2e~^, 



Proof: The proof is a consequence of Benett’s inequality. It suffices to estimate the 
moments of — /ip. For the first moment, 


|]E[T^,ij(X)-HI = 


E 


< E 


1 A 

1 - 


R 


\X — /ip 
R 


-Ij (X-/ip) 
\X — /ip 


\X — p,p 

<E[|X-/ip| l{|X-/ip| > R}] 


< E 


\X — fip\ 


1/4 _ 


|X-„|>RA-‘<^g£ 


where we used Holder’s inequality. On the other hand. 


E 


{^ii,RiX) - HpY 


< fjp 


By the Cauchy-Schwarz inequality, and using the bounded kurtosis assumption, 

R 


E 


- Iip| 


= E 


1 A 


\X - /ip| 


3 



Ix-Zipl^" 

< E 

|X-/ip|3 
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Finally, for any p > 4, since I'l'^ fi{X) — fJ.pf < R, 

E[\^^,n{X) - fipf] < RP-^Kpal> 
For s = y/2ntlap, we have |s-R| /n < 1 and therefore 


E 


e ^ 




^ ^ S Kpcjp 2 ,- O 4 / ^ 4! 

- ^ ^^ ^ 5 ^ 


n R^ 2ii? 6n^ 
3/2 


< exp 2\/2 — KpCTp 

V n \ n 


h \ ^ e2 „3 

\ O o C> 


5s^ 


By Chernoff’s bound, 


Pn'h/t,i? — PP > 2\/2Kp(Tp 


or, equivalently. 


n 


3/2 


2n2"P + 




Pn'I'/2,_R — /^P > 2\/2, 


KpCTp 


n 


3/2 


+ i + J_./n;i + lM <e-< 

n I 3-v/2 V n 48 n M 


Repeat the same computations with s = —y/2ntlap to prove the lower bound. □ 


Step 3: (Insensitivity of the estimators.) Given €fj,,eR G (0,1/2), define 

7^=|(/i,i^) : III - fip\ < e^ap, i? - dp v^re/(26max) < e/jo-pj , 


Pn ij,p,(jpy/n/{2bina.x) 


Lemma 7.3 Assume '^n/{2bmax) > 2(e^ + eji) then for any t > 0, with probability at least 
1 — e~^, for all (/r, R) G TZ, 


I » \ ^ ! \ ( ^66max 4\/6maxt 2t \ 

I A„.h| < (e„ + + - 1 
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Proof: Start with the trivial bound 


<\P- P'\ + 1^ - ^'1 

that holds for all {fj,, R), {fi', R') € TZ and x G M. Moreover, assume that \x — hp\ < 
o'P\/^/(8^max)- Then 

\x- fi\ <\x- fip\ + e^iTp < dp (^2yn7(86i^^ - < R . 

Hence, for any x G M such that |x — ^p| < dp y/n/{8b^g^^) and for all (/r, R) G TZ, 

T^,ij(x) = X . 

Therefore, for any {n,R) and {{J,R') in TZ and for any x G M, 

|T^,i?(x) - < (|/i - //'I + li? - -R'l) 1 ||x - /rp| > dpA/n/(86niax)| • 

By Chebyshev’s inequality, this implies that, for any positive integer p, 

P - 'p./.R'l’’ < (|m - r'I + |R - fl'D' ■ 

By Bennett’s inequality. 


max Ay/b 

max^ ^ 26~^ 

yl/r — ^'1 + |i? — i?'| n n 3nJ ~ 


To apply a chaining argument, consider the sequence {Dj)j>Q of points of TZ obtained by 
the following construction. Do = (/rp, dp-\/n/(26max) and, for any j > 1, divide TZ into 4-^ 
pieces by dividing each axis into 2^ pieces of equal sizes. Dehne then Dj as the set of lower 
left corners of the 4-^ rectangles. Then \ Dj\ = 4-^ and, for any (p, R) G TZ, there exists a point 
'Kj{pL,R) G Dj such that the £i-distance between {p,R) and TTj{p,R) is upper-bounded by 
2“-^ (e^-|-eij)dp. Therefore, 


sup 




< 


^ sup 

j>l 


A 




-A 




A union bound in Bennett’s inequality gives that, with probability at least 1 — 2^ *, for 

any {p,R) G Dj, 


>^li,R 


-A 


■nj-i{^,R) 


< (c/, + eij)dp 


166 n 


2-?n 


8V6max(t + j'logS) ^2t + 2j log 8 


2-?n 


3n2-? 


Summing up these inequalities gives the desired bound. □ 
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Corollary 7.1 Assume t > 1, yn/{2b^aa^ > + ^r)- Then, with probability at least 

1 - 2e“* - for all {fi, R) G TZ, 


Pn^^J.,R — MP 


< ^ (V2i(l+6)+6 

V JT' 


where 


^ 1 /Kpt 5 Kpi 

6 = :^^^/-^ + 7^^ + 2(e^ + e«) 


3-\/2 V n 48 n 


2^max 1 / ^ 


n 3 V ’ 


,3/2 , 

6 = 2V2kp^ + 56(e^ + en) — 
n n 


Step 4: (Wrap-up.) Define now 

ihn^Rn) = (^hbrnaxi Rbm 
From Lemma l7.11 with probability at least 1 — 2e“ 


n 


2b^ 


/ ^max 

1^2 2 1 

V n ’ 

\l/u — Op 

1 ^max ^ \ 


< 2eV60^T^4t/^ 

V n 


The second inequality gives 


'l-2e\/6(Kp + 3)W^225L < < W 1 + 2eV6 (kp + 3) 


n 


crp 


n 


Since we can assume that 


2eV6(^P + 3)A/^ 
V n 


< 1 , 


we deduce that 


T/, — Up 

I Dmax ^ 


<e^l2{Kp + 3)ap\[^^ 
V n 


This means that, with probability at least 1 — 2e {fin,Rn) belongs to TZ if we dehne 


= 2v^f 


n 


^ \/^5 = 2e-^3(Kp -|- 3) < 19-y/^ . 


By an appropriate choice of the constant C, we can always assume that y^n/(26max) is 
at least some large constant, to ensure that 2(e^ -|- er) < \/n/{2hme,x)■ So Corollary 17.11 
applies and gives 


P 


Pn'hg R h'P 

fTni^n 


< 


up 


2t(l + 6) + ^2 ) ) > 1 - 2e-* - 4e-^" 
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where 


6 = 36 


Kp 6n 


n 


6 = 2V2, 


^3/2 


/^P - 


n 


+ 1120'y/Kp- 


n 


46 ^ — 6n 


In particular, if <5 > 


we get 


P 




< 


(Tp 


2(l + ln(l/5))(l + 6)+6 


> 1 - - + 


e 4e/(e —2) 


5=1-5. 


Remark 2 Let us quickly sketch how one may get a smaller value o/5min at the expense of a 
larger constant L. The idea is to redo the proof of part 1 of Theorem \3.2\ (cf. Section\^. We 
build 5-dependent estimators for /ip via median-of-means, as in but then use the value 
2ab{Xi) from Lemma \7. 1\ instead of the value cjI when building the confidence interval, with 
a choice ofbK, ln(l/5). Then one obtains an empirical confidence interval that contains /ip 
and has the appropriate length with probability >1 — 25 whenever ln(l/5) < cnjn for some 
constant c > 0. Using Theorem \4.^ as in Section\^then gives a multiple-5 L-sub-Gaussian 
estimator for for large enough values ofnfK, where L does not depend 

on n or k. It is an open question whether one can obtain a similar value of 5min with 

L = \/2 + 0(1). 


8 Open problems 

We conclude the paper by a partial list of problems related to our results that seem especially 
interesting. 

Sharper constants and trnly snb-Ganssian estimators. For what families V of 
distributions and what values of 5min can one find multiple-5 estimators with sharp constant 
L = y/2 -\- o (1)? One may even sharpen our definition of a sub-Gaussian estimator and ask 
for estimators that satisfy 

P {\MX^) - /ip| > ap < (1 + o (1)) 5 

for all P G P and 5 G [5mirn 1)? 

Sub-Gaussian confidence intervals. The notion of sub-Gaussian confidence interval 
introduced in Section [4.21 seems interesting on its own right. For which classes of distribu¬ 
tions V can one hnd sub-Gaussian conhdence intervals? Gan one reverse the implication in 
Theorem 14.21 and build sub-Gaussian conhdence intervals from multiple-5 estimators? 
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Empirical risk minimization. Suppose now that the Xi are i.i.d. random variables 
that live in an arbitrary measurable space and have common distribution P. In a proto¬ 
typical risk minimization problem, one wishes to hnd an approximate minimum ) of 

a functional ^{0) := P f{9, X) over choices of 0 G 0. The usual way to do this is via empir¬ 
ical risk minimization, which consists of minimizing the empirical risk in{9) ■= P f {9, X) 
instead. Under strong assumptions on the family F := {f{9, (such as uniform bound¬ 

edness), the fluctuations of the empirical process {(Pn — P) f{9,X)}0^Q can be bounded in 
terms of geometric or combinatorial properties of F, and this leads to results on empirical 
risk minimization. However, the strong sub-Gaussian concentration results one may obtain 
are only available when F has very light tails. 

A natural way to obtain strong sub-Gaussian concentration for heavier-tailed F would 
be to replace the usual empirical estimates P f{9,X) by one of our multiple-5 sub-Gaussian 
estimates. This, however, is not straightforward. The usual chaining technique for control¬ 
ling empirical processes rely on linearity, and our estimators are nonlinear in the sample. 
Although there are (artihcial) ways around this, we do not know of any efficient method 
for doing the analogue of empirical risk minimization with our estimators in any nontrivial 
setting. These difficulties were overcome by Brownlees et al. [3] via Gatoni’s multiple- 
6 subexponential estimator, at the cost of obtaining weaker concentration. Can one do 
something similar and achieve truly sub-Gaussian results at low computational cost? 
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